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A GEOMETRIC INTERPRETATION OF THE PERMUTATION 
p- VALUE AND ITS APPLICATION IN eQTL STUDIES 
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University of North Carolina and University of North Carolina 

Permutation p-values have been widely used to assess the sig- 
nificance of linkage or association in genetic studies. However, the 
application in large-scale studies is hindered by a heavy computa- 
tional burden. We propose a geometric interpretation of permutation 
p-values, and based on this geometric interpretation, we develop an 
efficient permutation p-value estimation method in the context of 
regression with binary predictors. An application to a study of gene 
expression quantitative trait loci (eQTL) shows that our method pro- 
vides reliable estimates of permutation p-values while requiring less 
than 5% of the computational time compared with direct permuta- 
tions. In fact, our method takes a constant time to estimate permuta- 
tion p-values, no matter how small the p- value. Our method enables a 
study of the relationship between nominal p-values and permutation 
p-values in a wide range, and provides a geometric perspective on the 
effective number of independent tests. 

1. Introduction. With the advance of genotyping techniques, high den- 
sity SNP (single nucleotide polymorphism) arrays are often used in current 
genetic studies. In such situations, test statistics (e.g., LOD scores or p- 
values) can be evaluated directly at each of the SNPs in order to map the 
quantitative/qualitative trait loci. We focus on such marker-based study 
in this paper. Given one trait and p markers (e.g., SNPs), in order to as- 
sess the statistical significance of the most extreme test statistic, multiple 
tests across the p markers need to be taken into account. In other words, 
we seek to evaluate the first step family- wise error rate (FWER), or the 
"experiment- wise threshold" [Churchill and Doerge (1994)]. Because nearby 
markers often share similar genotype profiles, the simple Bonferroni cor- 
rection is highly conservative. In contrast, the correlation structure among 
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genotype profiles is preserved across permutations and thus is incorporated 
into permutation p-value estimation. Therefore, the permutation p-value is 
less conservative and has been widely used in genetic studies. Ideally, the 
true permutation p-value can be calculated by enumerating all the pos- 
sible permutations, calculating the proportion of the permutations where 
more extreme test statistics are observed. In each permutation, the trait is 
permuted, or equivalently, the genotype profiles of all the markers are per- 
muted simultaneously. However, enumeration of the possible permutations is 
often computationally infeasible. Permutation p-values are often estimated 
by randomly permuting the trait a large number of times, which can still be 
computationally intensive. For example, to accurately estimate a permuta- 
tion p- value of 0.01, as many as 1000 permutations may be needed [Barnard 
(1963), Marriott (1979)]. 

In studies of gene expression quantitative trait loci (eQTL), efficient per- 
mutation p-value estimation methods become even more important, be- 
cause in addition to the multiple tests across genetic markers, multiple tests 
across tens of thousands of gene expression traits need to be considered 
[Kendzioriski et al. (2006), Kendziorski and Wang (2006)]. One solution is 
a two-step procedure, which concerns the most significant eQTL for each 
expression trait. First, the permutation p- value for the most significant link- 
age/association of each expression trait is obtained, which takes account of 
the multiple tests across the genotype profiles. Second, a permutation p- 
value threshold is chosen based on a false discovery rate (FDR) [Benjamini 
and Hochberg (1995), Efron et al. (2001), Storey (2003)]. This latter step 
takes account of the multiple tests across the expression traits. Following 
this approach, the computational demand increases dramatically, not only 
because there are a large number of expression traits and genetic mark- 
ers, but also because stringent permutation p- value threshold, and therefore 
more permutations must be applied to achieve the desired FDR. In order to 
alleviate the computational burden of permutation tests, many eQTL stud- 
ies have merged the test statistics from all the permuted gene expression 
traits to form a common null distribution, which, as suggested by empirical 
studies, may not be appropriate [Carlborg et al. (2005)]. In this paper we 
estimate the permutation p-value for each gene expression trait separately. 

In order to avoid the large number of permutations, some computation- 
ally efficient alternatives have been proposed. Nyholt (2004) proposed to 
estimate the effective number of independent genotype profiles (hence the 
effective number of independent tests) by eigen-value decomposition of the 
correlation matrix of all the observed genotype profiles. Empirical results 
have shown that, while Nyholt's procedure can provide an approximation 
of the permutation p-value, it is not a replacement for permutation testing 
[Salyakina et al. (2005)]. In this study we also demonstrate that the effective 
number of independent tests is related to the significance level. 
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Some test statistics (e.g., score test statistics) from multiple tests asymp- 
totically follow a multivariate normal distribution, and adjusted p- values can 
be directly calculated [Conneely and Boehnke (2007)]. However, currently 
at most 1000 tests can be handled simultaneously, due to the limitation 
of multivariate normal integration [GenZ (2000)]. Lin (2005) has proposed 
to estimate the significance of test statistics by simulating them from the 
asymptotic distribution under the null hypothesis, while preserving the co- 
variance structure. This approach can handle a larger number of simultane- 
ous tests efficiently, but it has not been scaled up to hundreds of thousands 
of tests, and its stability and appropriateness of asymptotics have not been 
validated in this context. 

In this paper we present a geometric interpretation of permutation p- 
values and a permutation p- value estimation method based on this geomet- 
ric interpretation. Our estimation method does not rely on any asymptotic 
property and, thus, it can be applied when the sample size is small, or when 
the distribution of the test statistic is unknown. The computational cost of 
our method is constant, regardless of the significance level. Therefore, we 
can estimate very small permutation p-values, for example, 10~ 8 or less, 
while estimation by direct permutations or even by simulation of test statis- 
tics may not be computationally feasible. In principle, our approach can be 
applied to the data of association studies as well as linkage studies. How- 
ever, the high correlation of test statistics in nearby genomic regions plays 
a key role in our approach. Thus, the application to linkage data is more 
straightforward. We restrict our discussion to binary genotype data, which 
only take two values. Such data include many important classes of experi- 
ments: study of haploid organisms, backcross populations and recombinant 
inbred strains. This restriction simplifies the computation so that an efficient 
permutation p-value estimation algorithm can be developed. However, the 
general concept of our method is applicable to any categorical or numerical 
genotype data. 

The remainder of this paper is organized as follows. In Section 2 we first 
present the problem setup, followed by an intuitive interpretation of our 
method, and finally we describe the more complicated algebraic details. In 
Section 3 we validate our method by comparing the estimated permutation 
p- values with the direct values obtained by a large number of permutations. 
We also compare the permutation p-values with the nominal p-values to 
assess the effective number of independent tests. Finally, we discuss the 
limitations of our method, and suggest possible improvements. 

2. Methods. 

2.1. Notation and problem setup. Suppose there are p markers geno- 
typed in n individuals. The trait of interest is a vector across the n indi- 
viduals, denoted by y = (yi, . . . ,y n ), where yi is the trait value of the ith 
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individual. The genotype profile of each marker is also a vector across the n 
individuals. Throughout this paper, we use the term "genotype profile" to 
denote the genotype profile of one marker, instead of the genotype profile 
of one individual. Thus, a genotype profile is a point in the n-dimensional 
space. We denote the entire genotype space as f2, which includes 2 n distinct 
genotype profiles. 

As mentioned in the Introduction, we restrict our discussion to binary 
genotype data, which only take two values. Without loss of generality, we 
assume the two values are and 1. Let m\ = (mu, . . . , mi n ) and n%2 = 
(to2i, . . . ,m2n) be two genotype profiles. We measure the distance between 
mi and m,2 by Manhattan distance, that is, 



We employ Manhattan distance because it is easy to compute and it has an 
intuitive explanation: the number of individuals with different genotypes. 
In our algorithm the distance measure is only used to group genotype pro- 
files according to their distances to a point in the genotype space. There- 
fore, any distance measure that is a monotone transformation of Manhat- 
tan distance leads to the same grouping of the genotype profiles, hence the 
same estimate of the permutation p-value. For binary genotype data, any 
distance measure (^27=1 \ mii ~ m 2i\ T1 Y 2 (Vri,T2 > 0) is a monotone trans- 
formation of Manhattan distance. We note, however, this is not true for 
categorical genotype data with more than two levels. For example, sup- 
pose the genotype of a biallelic marker is coded by the number of minor 
allele. Consider three biallelic markers with genotypes measured in three 
individuals: mi = (0,0,0), ni2 = (0,2,0) and 7713 = (1,1,1). By Manhattan 
distance, dM(^i>"^2) = 2 < dyi(jni,fnz) = 3. However, by Euclidean dis- 
tance, d(m\,m2) = 2 > d(mi,m^) = y/3. Therefore, different distance mea- 
sures may not be equivalent and the optimal distance measure should be 
the one that is best correlated with the test-statistic. 

In the following discussions we assume one test statistic has been com- 
puted for each marker (locus). Our method can estimate permutation p- 
value for any test statistic. For the simplicity of presentation, throughout 
this paper we assume the test statistic is the nominal p-value. 

2.2. A geometric interpretation of permutation p-values. One fundamen- 
tal concept of our method is a so-called "significance set." Let a be a genome- 
wide threshold used for the collection of nominal p-values from all the mark- 
ers. A significance set $(a) denotes, for a fixed trait of interest, the set of 
possible genotype profiles (whether or not actually observed) with nominal 
p-values no larger than a. Similarly, we denote such genotype profiles in the 
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ith. permutation as <3?j(a). Since permuting the trait is equivalent to permut- 
ing all the genotype profiles simultaneously, <3?j(a) is simply a permutation 
of $(a). 

Whether any nominal p-value no larger than a is observed in the ith 
permutation is equivalent to whether <3?j(a) captures at least one observed 
genotype profile. With this concept of a significance set, we can introduce 
the geometric interpretation of the permutation p-value: 

The permutation p-value for nominal p-value a is, by definition, the pro- 
portion of permutations where at least one nominal p-value is no larger than 
a. This is equivalent to the proportion of {&i(a)} that capture at least one 
observed genotype profile. Therefore, the permutation p-value depends on the 
distribution of the genotype profiles within &i(a) and the distribution of the 
observed genotype profiles in the entire genotype space. 

Intuitively, the permutation p-value depends on the trait, the observed 
genotype profiles and the nominal p-value cutoff a. In our geometric inter- 
pretation we summarize these inputs by two distributions: the distribution 
of all the observed genotype profiles in the entire genotype space, and the 
distribution of the genotype profiles in <3?j(a), which include the information 
from the trait and the nominal p-value cutoff a. 

We first consider the genotype profiles in <&i(a). For any reasonably small 
a (e.g., a = 0.01), all the genotype profiles in $i(a) should be correlated, 
since they are all correlated with the trait of interest. Therefore, we can 
imagine these genotype profiles in &i(a) are "close" to each other in the 
genotype space and form a cluster (or two clusters if we separately consider 
the genotype profiles positively or negatively correlated with the trait). In 
later discussions we show that under some conditions, the shape of one clus- 
ter is approximately a hypersphere in the genotype space. Then, in order 
to characterize $«(«), we need only know the center and radius of the cor- 
responding hyperspheres. In more general situations where $>i(a) cannot 
be approximated by hyperspheres, we can still define its center and further 
characterize the genotype profiles in <3?j(a) by a probability distribution: 
P(r,a), which is the probability a genotype profile belongs to ^(a), given 
its distance to the center of <&j(a) is r (Figure 1A). We summarize the 
information across all the $j(a)'s to estimate permutation p-values. Since 
{^(a)} is a one-to-one mapping of all the permutations, we actually es- 
timate permutation p-values by acquiring all the permutations. Therefore, 
the computational cost is constant regardless of a. We show this seemingly 
impossible task is actually doable. First, because permutation preserves dis- 
tances among genotype profiles, the probability distributions from all the 
significance sets {^(a), &i(a)} are the same. Therefore, we only need to 
calculate it once. Second, the remaining task is to count the qualifying sig- 
nificance sets, which can be calculated efficiently using combinations, with 
some approximations. 
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Fig. 1. A two-dimensional schematic representation of the geometric interpretation of 
permutation p-value, reflecting genotype profiles that actually reside in 2" -space. (A) In the 
general situation, the function P(r, a), shown in grayscale, decreases with distance from the 
center of a significance set. Under hypersphere assumption, P(r, a) is either or 1, thus, 
it can be illustrated by a hypershpere surrounding the center of the significance set. (B) The 
space occupied by the series of markers is calculated serially. Denote the neighborhood re- 
gion of the hth marker as B^. Then the contribution of the hth marker to ty(r a ) is approx- 
imated by Bh\(Bh fl Bh-i) , where '\" indicate set difference. As indicated by the darker- 
shade, this serial counting approximation is not exact when (B^ OBfc) ^ (B^ nB^-i), for 
any k < h — 1. Note the dot in (A) is the center of a significance set, while the dots in (B) 
are the observed marker genotype profiles. 

The distribution of the observed genotype profiles in the genotype space 
depends on the number of the observed genotype profiles and their correla- 
tion structure. Since <&i(a) may be thought of as randomly located in the 
genotype space in each permutation, on average, the chance that $i(a) cap- 
tures at least one observed genotype profile depends on how much "space" 
the observed genotype profiles occupy. We argue that such space include the 
observed genotype profiles as well as their neighborhood regions. How to 
define the neighborhood regions? We first consider the conceptually simple 
situation that &i(a) forms a hypersphere of radius r a , where the subscript 
a indicates that r a is a function of a. Then <&i(a) captures an observed 
genotype profile mi if its center is within the hypersphere centered at mi 
with radius r a . Therefore, the neighborhood region of mi is a hypersphere 
of radius r a . We take the union of the neighborhood regions of all the ob- 
served genotype profiles and denote it by ^(r a ) (Figure IB). Then we can 
evaluate permutation p-values by calculating the proportion of significance 
sets with their centers within Vl/(r a ). In the general situation where the hy- 
persphere assumption does not hold, a significance set <3?i(a) is characterized 
by a probability distribution P(r,a). Instead of counting a significance set 
by or 1, we count the probability it captures at least one observed geno- 
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type profile. We will discuss this estimation method more rigorously in the 
following sections. 

Before presenting the algebraic details, we emphasize that our method 
uses the entire set of the observed genotypes profiles simultaneously. Specif- 
ically, the correlation structure of all the genotype profiles is incorporated 
into the construction of ^f(r a ). The higher the correlations between the ob- 
served genotype profiles, the more the corresponding neighborhood regions 
overlap (Figure 1). This in turn produces a smaller space f (r Q ), and thus 
a smaller permutation p- value. In the extreme case when all the observed 
genotype profiles are the same, there is effectively only one test and the 
permutation p-value should be close to the nominal p-value. 

2.3. From significance set to best partition. Explicitly recording all the 
elements in all the significance sets is not computationally feasible. We in- 
stead characterize each significance set by a best partition, which can be 
understood as the center of the significance set, and a probability distribu- 
tion: the probability that one genotype profile belongs to the significance 
set, given its distance to the best partition. 

We first define best partition. The best partition for <J>(a) [or 3>j(a)] is a 
partition of the samples that is most significantly associated with the trait 
(or the ith permutation of the trait). For a binary trait, the trait itself 
provides the best partition. For a quantitative trait, we generate the best 
partition by assigning the smallest t-values to one phenotype class and the 
other (n — t)-values to another phenotype class. We typically use t = n/2 as 
a robust choice. The robustness of this choice is illustrated by the empirical 
evidence in the Supplementary Materials [Sun and Wright (2009)]. Given 
t, we refer to all the possible best partitions (partitions that divide the n 
individuals into two groups of size t and n — t) as desired partitions. The 
total number of distinct desired partitions, denoted by N p , is 




if n/2, 
if t = n/2. 



When t = n/2, there are (?) ways to choose t individuals, but two such 
choices correspond to one partition, that is why we need the factor 1/2. For 
a binary trait, the desired partitions and the significance sets have one-to-one 
correspondence and, thus, N p is the total number of significance sets (or the 
total number of permutations). For a quantitative trait, N p is much smaller 
than the total number of significance sets. In fact, each desired partition 
corresponds to t\(n — t)\ distinct significance sets (or permutations). Since 
we restrict our study for binary genotype, this definition of best partition can 
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be understood as the projection of the trait into the genotype space. This 
projection is necessary to utilize the geometric interpretation of permutation 
p- value. Note the best partition does not replace the trait since the trait data 
is still used in calculating P(r, a). The projection of trait into genotype space 
is less straightforward when the genotype has three or more levels, though 
it is still feasible. Further theoretical and empirical studies are needed for 
such genotype data. 

Next, we study the probability that one genotype profile belongs to a 
significance set given its distance to the best partition of the significance 
set. Each desired partition, denoted as DPj, has perfect correspondence 
with two genotype profiles, depending on whether the first t-values are or 
1. We denote these two genotype profiles as m® and mj, respectively. The 
distance between one genotype profile mi and one desired partition DPj is 
defined as 

dM(mi,DPj)= min {du(mi, m?)}. 

a=0,l 

Suppose DPj is the best partition of the significance set $i(a). In general, 
the smaller the distance from a genotype profile to DPj, the greater the 
chance it falls into $i(a). Thus, the genotype profiles in <3?i(a) form two 
clusters, centered on m® and mj, respectively. The probability distribution 
we are interested in is 

Pr(mi G *i(a)|Vmi G n,d M (mi,DPj) =r). 

This probability certainly depends on the trait y. However, because all of our 
inference is conducted on y, we have suppressed y in the notation. A similar 
probability distribution can be defined for the significance set <3?(a). Because 
the permutation-based mapping <£(a) — > &i(a) preserves distances, the dis- 
tributions for <£(a) and &i(a) are the same and, thus, we need only quantify 
the distribution for $(a). We denote the best partition of the unpermuted 
trait y as DP y , and denote the two genotype profiles corresponding to DP y 
as m® and m y , then we define the distribution as follows: 

(2.2) P(r, a) = Pr(mi G $(a)|Vmi G ft, du(mi,DP y ) = r). 
Let 

(2.3) P(niy,r,a) = Pr(mi G $(a)|Vmi G Q,d M (mi,m£) = r), 
where a = 0, 1. We have the following conclusion. 

Proposition 1. P(r,a) = P(m°,r, a) = P(m y ,r,a) for any r <n/2. 

The proof is in the Supplementary Materials [Sun and Wright (2009)]. 
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By Proposition 1, in order to estimate P(r,a), we can simply estimate 
P(my,r,a). Specifically we first randomly generate H genotype profiles 
{rrih'-h = l,...,H} so that dy[{Tnh, rriy) = r. To generate rrih, we flip the 
genotype of m® for r randomly chosen individuals. Then P(r, a) is esti- 
mated by the proportion of {m^} that yield nominal p- values no larger than 
a. 

In summary, we characterize a significance set by the correspond- 

ing best partition and the probability distribution P(r,a). All the distinct 
best partitions are collectively referred to as desired partitions. This char- 
acterization of significance sets has two advantages. First, the probability 
distribution P(r, a) is the same across all the significance sets, so we need 
only calculate it once. This is because the probability distribution relies on 
distance measure, which is preserved across significance sets (permutations). 
Second, for a quantitative trait, one desired partition corresponds to a large 
number of significance sets; therefore, we significantly reduce the dimension 
of the problem by considering desired partitions instead of significance sets. 

2.4. Estimating permutation p-values under a hypersphere assumption. 
By the definition of a significance set, we can calculate the permutation p- 
value by counting the number of significance sets that capture at least one 
observed genotype profile. However, it is still computationally infeasible to 
examine all significance sets. Therefore, in the previous section we discuss 
how to summarize the significance sets by desired partitions and a common 
probability distribution. In this and the next sections, we study how to 
estimate permutation p-values by "counting" desired partitions. 

To better explain the technical details, we begin with a simplified situa- 
tion, by assuming there is an r a such that P(r, a) = 1 if r < r a and P(r, a) 
= otherwise. This is equivalent to assuming <I>(a) or <3?i(a) occupies two 
hyperspheres with radius r a . This hypersphere assumption turns out to be 
a reasonable approximation for a balanced binary trait (see Supplementary 
Materials [Sun and Wright (2009)]). 

Let {m Dj fc,l < k < p} be the observed p genotype profiles. We formally 
define the space occupied by the observed genotype profiles and their neigh- 
borhood regions as 

#(r a ) = \ mi :m 1 G tt, min {du(mx, m ,k)} < r a \, 

v. l<k<P ' 

that is, all the possible genotype profiles within a fixed distance r a from at 
least one of the observed genotype profiles. We have the following conclusion 
under the hypersphere assumption. 

Proposition 2. Consider a significance set $i(a) occupying two hy- 
perspheres centered at m® and mj, respectively, with radius r a . ^(a) cor- 
responds to one permutation of the trait. The minimum nominal p-value of 
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The proof is in the Supplementary Materials [Sun and Wright (2009)]. 

Based on Proposition 2, we can calculate the permutation p-value by 
counting the number of significance sets with at least one of its centers be- 
longing to ^(r a ). Note under this hypersphere assumption, for any fixed a 
(hence fixed r a ), the significance sets are completely determined by the cen- 
ters of the corresponding hyperspheres. Thus, there is a one-to-one mapping 
between significance sets and their centers, the desired partitions. Counting 
significance sets is equivalent to counting desired partitions. Therefore, we 
can estimate the permutation p-value by counting the number of desired 
partitions. Specifically, let the distances from all the observed genotype pro- 
files to DPj, sorted in ascending order, be (rji, . . . , rj P ). Then under the 
hypersphere assumption, the permutation p-value for significance level a is 



where N p is the total number of desired partitions, and C(r a ) = \{DPj : Tj\ < 
r a }\ is the number of desired partitions within a fixed distance r a from at 
least one of the observed genotype profiles. The calculation of C{r a ) will be 
discussed in the next section. 

We note that the hypersphere assumption is not perfect even for the 
balanced binary trait. We employ the hypersphere assumption to give a 
more intuitive explanation of our method. In the actual implementation 
of our method, even for a balanced binary trait, we still use the general 
approach to estimate permutation p- values, as described in the next section. 

2.5. Estimating permutation p-values in general situations. In general 
situations where the hypersphere assumption does not hold, we estimate 
the permutation p-value by 



where Pr(DPj,a) is the probability that the minimum nominal p-value < 
a given DPj is the best partition. Equation (2.5) is a natural extension 
of equation (2.4) by replacing the counts with the summation of probabil- 
ities. It is worth noting that in the previous section, one desired partition 
corresponds to one significance set given the hypersphere assumption. How- 
ever, in general situations, one desired partition may correspond to many 
significance sets. Therefore, Pr(DPj,a) is the average probability that the 
minimum nominal p-value < a for all the significance sets centered at DPj . 
Taking averages does not introduce any bias to permutation p-value esti- 
mation, because permutation p-value is itself an average. Here we just take 



(2.4) 



HDPj-.rj^r^l/Np^Cir^/Nj 



(2.5) 
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the average in two steps. First, we average across all the significance sets 
(or permutations) corresponding to the same desired partition to estimate 
Pr(DPj, a). Second, we average across desired partitions. 

Let all the desired partitions whose distances to an observed genotype 
profile m 0) k are no larger than r be Bk(r), that is, 

B fe (r) = {DPj : d M (m ,k, DPj) < r}, 

where 1 < k < p. Assume the observed genotype profiles {m 0i fc} are ordered 
by the chromosomal locations of the corresponding markers. We employ the 
following two approximations to estimate ^ - Pr(DPj, a): 

1. shortest distance approximation: 

Pv(DP„ a )^P(r n , a ), 

2. serial counting approximation: 

p p 
C{r) « Cu(r) = \B h {r)\ - £ \B h (r) D B h -i(r)|, 

h=l h=2 

where C(r) has been defined in equation (2.4). 

Proposition 3. As long as a is reasonably small, for example, a < 
0.05, there exist r^ < r\j , such that P(r,a) = 1, if r < r^; P(r,a) = 0, if 
r >rjj. Given the shortest distance and the serial counting approximations, 

J2P<DP j} a)^Yl P ( r n^) 

j j 

(2.6) 

ru—l 

~Cu(r L )+ Yl [P(r,a)(Cu(r)-Cu(r-l))]. 

r=r L +l 

When a is extremely small, for example, a = 10~ 20 , it is possible rj_ = 0. 
We define 0/(0) = to incorporate this situation into equation (2.6). 

In the Supplementary Materials [Sun and Wright (2009)], we present the 
derivation of Proposition 3, as well as Propositions 4 and 5 that provide the 
algorithms to calculate \Bh(r)\ and \Bh(r) n Bh-i(r)\ , respectively. There- 
fore, by Propositions 3-5, we can estimate the permutation p- value by equa- 
tion (2.5). 

The rationale of shortest distance approximation is as follows. If the space 
occupied by a significance set is approximately two hyperspheres, this ap- 
proximation is exact. Otherwise, if a is small, which is the situation where 
direct permutation is computationally unfavorable, this approximation still 
tends to be accurate. This is because when a is smaller, the genotype profiles 
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within the significance set are more similar and, hence, the significance set is 
better approximated by two hyperspheres. In Section 3 we report extensive 
simulations to evaluate this approximation. 

The serial counting approximation can be justified by the property of 
genotype profiles from linkage data, and (with less accuracy) in some kinds 
of association data. In linkage studies, the similarity between genotype pro- 
files is closely related to the physical distances, with conditional indepen- 
dence of genotypes between loci given the genotype at an intermediate locus. 
Therefore, the majority of the points in Bh(r) n Bh-k{ r ) (2 <k <h — 1) 
already included in Bh(r) n Bh_\{r) (Figure IB) and, thus, 

B h (r) n( |J B fc (r)J « B h (r) D B ft _i(r). 

\<k<h-l ' 

Then, we have 

C(r) = J2\Bk(r)\-J2 



arc 



fe=l 
p 



h=2 
V 



B h {r) n 



U B k {r)\ 
Kk<h-1 ' 



£|B fc (r)|-£|B A (r)nB*_i(r)|. 



fe=i 



h=2 



Our method has been implemented in an R package named permute. t, 
which can be downloaded from http: / /www. bios. unc.edu/~wsun/software. htm. 



3. Results. 



3.1. Data. We analyzed an eQTL data set of 112 yeast segregants gen- 
erated from two parent strains [Brem and Kruglyak (2005), Brem et al. 
(2005)]. Expression levels of 6229 genes and genotypes of 2956 SNPs were 
measured in each of the segregants. Yeast is a haploid organism and, thus, 
the genotype profile of each marker is a binary vector of O's and l's, indi- 
cating the parental strain from which the allele is inherited. We dropped 15 
SNPs that had more than 10% missing values, and then imputed the missing 
values in the remaining SNPs using the function fill.geno in R/qtl [Broman 
et al. (2003)]. Finally, we combined the SNPs that have the same genotype 
profiles, resulting in 1017 distinct genotype profiles. 3 As expected, genotype 



3 Most SNPs sharing the same genotype profiles are adjacent to each other, although 
there are 10 exceptions in which the SNPs with identical profiles are separated by a few 
other SNPs. In all the 10 exceptions, the gaps between the identical SNPs are less than 
10 kb. We recorded the position of each combined genotype profile as the average of the 
corresponding SNPs' positions. 
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profiles between chromosomes have little correlation (Figure 2 in the Sup- 
plementary Materials [Sun and Wright (2009)]), while the correlations of 
genotype profiles within one chromosome are closely related to their phys- 
ical proximity (Figure 3 in the Supplementary Materials [Sun and Wright 
(2009)]). 

3.2. Evaluation of the shortest distance approximation. We evaluate the 
shortest distance approximation Pr(DPj,a) ~ P(rj\,a) in this section. Be- 
cause the permutation p-value is actually estimated by the average of 
Pr (DP j, a) [equation (2.5)], it is sufficient to study the average of Pr(DPj, a) 
across all the DPj's having the same rj\. Specifically, we simulated 50 de- 
sired partitions {DPj,j = 1, . . . , 50} such that, for each DPj, Tj\ = r. Sup- 
pose DPj divides the n individuals into two groups of size t and n — t; then 
DPj is consistent with t\(n — t)\ permutations of the trait. We randomly 
sampled 1000 such permutations to estimate Pt(DPj, a). We then took the 
average of these 50 Pr(DPj, a)'s, denoted it as p(r), and compared it with 
P(r,a). 

We randomly selected 88 gene expression traits. For each gene expression 
trait, we chose a to be the smallest nominal p-value (from t-tests) across all 
the 1,107 genotype profiles. We first estimated P(r,a) and p(r), and then 
examined the ratio P(r,a)/ p(r) at three distances r^, i = 1,2,3, where = 
argmin r {|P(r, a) — 0.25i|}, that is, the approximate 1st quartile, median and 
3rd quartile of P(r, a) when P(r, a) is between and 1 (Figure 2). For the 
genes with larger nominal p- values, P(r, a) / p(r) can be as small as 0.4. Thus, 
the shortest distance approximation is inaccurate. We suggest estimating the 
permutation p- values for the genes with larger nominal p- values by a small 
number of direct permutations, although, in practice, such nonsignificant 
genes may be of little interest. After excluding genes with nominal p- values 
larger than 2 x 10~ 4 , on average, P(r,a)/p(r) is 0.80, 0.88, 0.95 for the 
1st, 2nd and 3rd quartile respectively. We chose the threshold 2 x 10~ 4 
because it approximately corresponds to permutation p-value 0.05 ~ 0.10 
(see Section 3.4. Comparing permutation p- value and nominal p- value). It is 
worth emphasizing that when we estimate permutation p- values, we average 
across DPj's. In many cases, P(rji,a) = or 1 and, thus, Pr(DPj,a) = 
P(rji,a). Therefore, after taking the average across DPj's, the effects of 
those cases with small P(r,a)/ p(r) will be minimized. 

3.3. Permutation p-value estimation for a balanced binary trait — evaluation 
of the serial counting approximation. Using the genotype data from the 
yeast eQTL data set, we performed a genome-wide scan of a simulated bal- 
anced binary trait, with 56 0's and 56 l's. The standard chi-square statistic 
was used to quantify the linkages. As we discussed before, for a balanced 
binary trait, the space occupied by a significance set is approximately two 
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Fig. 2. Evaluation of the shortest distance approximation using 88 randomly selected 
gene expression traits. For each gene expression trait, the ratio P(r,a)/p(r) is plotted at 
three r's, which are approximately the 1st quartile, median and 3rd quartile of P(r,ct) when 
P(r,a) is between and 1. The vertical broken line indicates the nominal p-value 2 x 1CP 4 , 
which corresponds to genome-wide permutation p-value 0.05 ~ 0.10. 

hyperspheres, and the shortest distance approximation is justified. This con- 
clusion can also be validated empirically by examining P(r, a). As shown in 
Table 3 of the Supplementary Materials [Sun and Wright (2009)], for each 
a, there is an r a , such that P(r, a) = 1 if r < r a , and P(r, a) ~ if r > r a . 
From the sharpness of the boundary we can see that a significance set indeed 
can be well approximated by two hyperspheres. Given that the shortest dis- 
tance approximation is justified, we can evaluate the accuracy of the serial 
counting approximation by examining the accuracy of permutation p-value 
estimates. 

The accuracy of the serial counting approximation relies on the assump- 
tion that the adjacent genotype profiles are more similar than the distant 
ones. We dramatically violate this assumption by randomly ordering the 
SNPs in the yeast eQTL data. As shown in Table 1, the permutation p- 
value estimates from the original genotype data are close to the permutation 
p-values estimated by direct permutations, whereas the estimates from the 
location-perturbed genotype data are systematically biased. 

3.4. Permutation p-value estimation for quantitative traits. We randomly 
selected 500 gene expression traits to evaluate our permutation p-value es- 
timation method in a systematic manner. We used t-tests to evaluate the 
linkages between gene expression traits and binary markers. For each gene 
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Table 1 

Comparison of permutation p-value estimates for a balanced binary trait. Values at the 
column of "Permutation p-value" are estimated via 500,000 permutations. Values at the 
columns 'Permutation p-value estimate I/H" are estimated by our method before and 
after perturbing the locations of the SNPs 



Nominal 


Permutation 


Permutation 


Permutation 


p-value 


p-value 


p-value 


p-value 


cutoff 




estimate I 


estimate II 


1(T 3 


0.19 


0.21 


0.41 


1(T 4 


0.02 


0.021 


0.039 


1(T 5 


2.0 x 10~ 3 


1.9 x 10~ 3 


2.9 x 10 -3 


1(T 6 


2.4 x 10 -4 


2.2 x 10 -4 


3.1 x 10 -4 



expression trait, we first identified the genome- wide smallest p-value, and 
then estimated the corresponding permutation p-value by either our method 
or by direct permutations [Figure 3(a)]. For those relatively larger permu- 
tation p- values (>0.1), the estimates from our method tend to be inflated. 
Some of them are even greater than 1. This is because the serial counting 
approximation is too loose for larger permutation p-values, due to the fact 



O) CM 
CL I 



to _ 



(a) 




1 
















1 1 
-5 -4 


i i i i 

-3 -2 -1 



OJ CM 



(b) 




e 




•* o 




I I 
-5 -4 


I I I I 
-3 -2 -1 



Iog10(pp) 



Iog10(pp) 



Fig. 3. Comparison of permutation p-values estimated by our method (denoted as pe) 
or by direct permutations (denoted as pp) for 500 randomly selected gene expression traits 
(each gene corresponds to one point in the plot), (a) Using the original genotype data, (b) 
Using the location-perturbed genotype data. Each gene expression trait is permuted up to 
500,000 times to estimate pp. Thus, the smallest permutation p-value is 2 x 10 -6 , and we 
have more confidence for those permutation p-values bigger than 2 x 10~ 4 (indicated by 
the vertical line). The degree of closeness of the points to the solid line (y = x) indicates 
the degree of consistency of the two methods. The two broken lines along the solid line 
are y — x ± log 10 (2) respectively, which, in the original p-value scale, are pe= 0.5pp and 
pe — 2pp, respectively. 
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that each significance set occupies a relatively large space. Nevertheless, 
the two estimation methods give consistent results for those permutation 
p- values smaller than 0.1. We also estimated the permutation p- values after 
perturbing the order of the SNPs [Figure 3(b)]. As expected, the permuta- 
tion p- value estimates are inflated. 

The advantage of our method is the improved computational efficiency. 
The computational burden of our method is constant no matter how small 
the permutation p- value is. To make a fair comparison, both our estima- 
tion method and direct permutation were implemented in C. In addition, 
for direct permutations, we carried out different number of permutations 
for different gene expression traits so that a large number of permutations 
were performed only if they were needed. Specifically, we permuted a gene 
expression trait 100, 1000, 5000, 10,000, 50,000 and 100,000 times if we had 
99.99% confidence that the permutation p- value of this gene was bigger than 
0.1, 0.05, 0.02, 0.01, 0.002 and 0.001, respectively. Otherwise we permuted 
500,000 times. It took 79 hours to run all the permutations. If we ran at 
most 100,000 permutations, it took about 20 hours. In contrast, our method 
only took 46 minutes. All the computation was done in a computing server 
of Dual Xenon 2.4 Ghz. 

3.5. Comparing permutation p-values and nominal p-values. The results 
we will report in this section are the property of permutation p-values, in- 
stead of an artifact of our estimation method. However, using direct per- 
mutation, it is infeasible to estimate a very small permutation p-value, for 
example, 10 -8 or less. In contrast, our estimation method can accurately 
estimate such permutation p-values efficiently 4 This enables a study of the 
relationship between permutation p-values and nominal p-values. Such a re- 
lationship can provide important guidance for the sample size or power of a 
new study. 

Let x and y be log 10 (nominal p- value) and log 10 (permutation p- value 
estimate) respectively. We compared x and y across the randomly selected 
500 gene expression traits used in the previous section [Figure 4(a)] and 
found an approximate linear relation. 

We employed median regression (R function rq) to capture the linear 
pattern [Figure 4(b)]. 5 If the nominal p- value was too large or too small, 



4 Our method cannot estimate those extremely small permutation p-values such as 
10~ 20 reliably. This is simply because only a few genotype profiles can yield such signif- 
icant results even in the whole genotype space. Nevertheless, those results correspond to 
unambiguously significant findings even after Bonferroni correction. Therefore, permuta- 
tion may not be needed. See the Supplementary Materials [Sun and Wright (2009)] for 
more details. 

5 Most genes whose fitted values differ from the observed values more than 2- folds are 
below the linear patterns. These genes often have more outliers than other genes, which 
may violate the t-test assumptions and bring bias to nominal p-values. 
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Fig. 4. Comparison of permutation p-value estimates and nominal p -values, (a) Scatter 
plot of permutation p-value estimates vs. nominal p-value in loglO scale for the 500 gene 
expression traits. Those unreliable permutation p-value estimates are indicated by "x." See 
footnote 2 for explanation, (b) Scatter plot for 483 gene expression traits with nominal 
p-value larger than 10 -20 . In both (a) and (b) the solid line is y = x. In (b), the broken 
line fitting the data is obtained by median regression for those 359 genes with nominal 
p-values between 1CP 10 and 1CP 3 . 



the permutation p-value estimate might be inaccurate. Thus, we used the 
359 gene expression traits with nominal p-value between 10~ 10 and 10~ 3 to 
fit the linear pattern (in fact, using all the 483 gene expression traits with 
nominal p-values larger than 10 -20 yielded similar results, data not shown). 
The fitted linear relation is y = 2.52 + 0.978x. Note x and y are in log scale. 
In terms of the p-values, the relation is q = r/p K = 327. 5p° 8 , where p and 
q indicate nominal p-value and permutation p-value, respectively. If k = 1, 
q = r/p, and i] can be interpreted as the effective number of independent tests 
(or the effective number of independent genotype profiles). However, the 
observation that k is close to but smaller than 1 (lower bound 0.960, upper 
bound 0.985) implies that the effective number of independent tests, which 
can be approximated by q/p = r]p K ~ l = ^p-0- 022 ; varies according to the 
nominal p-value p. For example, for p = 10 -3 and 10 -6 , the expected effective 
number of independent tests is approximately 381 and 444, respectively. 

The relation between the effective number of independent tests and the 
significance level can be explained by the geometric interpretation of per- 
mutation p-values. Given a nominal p-value cutoff, whether two genotype 
profiles correspond to two independent tests amounts to whether they can be 
covered by the same significance set. As the p-value cutoff becomes smaller, 
the significance set becomes smaller and, thus, the chance that two genotype 
profiles belong to one significance set is smaller. Therefore, smaller p-value 
cutoff corresponds to more independent tests. 
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4. Discussion. In this paper we have proposed a geometric interpretation 
of permutation p-values and a method to estimate permutation p-values 
based on this interpretation. Both theoretical and empirical results show 
that our method can estimate permutation p- values reliably, except for those 
extremely small or relatively large ones. The extremely small permutation 
p- values correspond to even smaller nominal p- values, for example, 10~ 20 . 
They indicate significant linkages/associations even after Bonferroni correc- 
tion; therefore, permutation p- value evaluation is not needed. The relatively 
large permutation p-values, for example, those larger than 0.1, can be esti- 
mated by a small number of permutations, although in practice such non- 
significant cases may be of little interest. The major computational advan- 
tage of our method is that the computational time is constant regardless of 
the significance level. This computational advantage enables a study of the 
relation between nominal p- values and permutation p- values in a wide range. 
We find that the effective number of independent tests is not a constant; it 
increases as the nominal p- value cutoff becomes smaller. This interesting ob- 
servation can be explained by the geometric interpretation of permutation 
p-values and can provide important guidance in designing new studies. 

Parallel computation is often used to improve the computational efficiency 
by distributing computation to multiple processors/computers. Both direct 
permutation and our estimation method can be implemented for parallel 
computation. In the studies involving a large number of traits (e.g., eQTL 
studies) , one can simply distribute an equal number of traits to each proces- 
sor. If there are only one or a few traits of interest, for direct permutation, 
one can distribute an equal number of permutations to each processor. For 
our estimation method, the most computationally demanding part (which 
takes more than 80% of the computational time) is to estimate P(r,a), 
which can be paralleled by estimating P(r,a) for different r's separately. 
Furthermore, for a particular r, P(r,a) is estimated by evaluating the nom- 
inal p-values for a large number of genotype profiles whose distances to the 
best partition are r. The computation can be further paralleled by evaluating 
nominal p-values for a subset of such genotype profiles in each processor. 

As we mentioned at the beginning of this paper, we focus on the genetic 
studies with high density markers, where the test statistics are evaluated 
on each of the genetic markers directly. Our permutation p- value estimation 
method cannot be directly applied to interval mapping [Lander and Bot- 
stein (1989), Zeng (1993)]. However, we believe that as the expense of SNP 
genotype array decreases, most genetic studies will utilize high density SNP 
arrays. In such situations, the interval mapping may be no longer necessary. 

We have discussed how to estimate the permutation p- value of the most 
significant linkage/association. Permutation p- values can also be used to 
assess the significance of each locus in multiple loci mapping. Doerge and 
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Churchill (1996) have proposed two permutation-based thresholds for mul- 
tiple loci mapping, namely, the conditional empirical threshold (CET) and 
residual empirical threshold (RET). Suppose k markers have been included 
in the genetic model, and we want to test the significance of the (k + l)th 
marker by permutation. The samples can be stratified into 2 k genotype 
classes based on the genotype of the k markers that are already in the model 
(here we still assume genotype is a binary variable). CET is evaluated based 
on permutations within each genotype class. Alternatively, the residuals of 
the fc-marker model can be used to test the significance of the (k + l)th 
marker. RET is calculated by permuting the residuals across the individu- 
als. RET is more powerful than CET when the genetic model is correct since 
the permutations in RET are not restricted by the 2 k stratifications. Our 
permutation p-value estimation method can be applied to RET estimation 
without any modification, and it can also be used to estimate CET with 
some minor modifications. Specifically, let conditional desired partitions be 
the desired partitions that can be generated by the conditional permuta- 
tions. Then in equation (2.5), N p should be calculated as the number of 
conditional desired partitions instead of the total number of desired parti- 
tions. In equation (2.6), P(r,a) remains the same and Cjj(r) needs to be 
calculated by counting the number of conditional desired partitions within 
distance r from at least one of the observed genotype profiles. 

There are some limitations in the current implementation of our method, 
which are also the directions of our future developments. First, we only dis- 
cuss binary markers in this paper. The counting procedures in Propositions 
4 and 5 (see Section IV in the Supplementary Materials [Sun and Wright 
(2009)]) can be extended in a straightforward way to apply to the geno- 
types with three levels. However, some practical considerations need to be 
addressed carefully, for example, the definition of the distance between geno- 
type profiles and the choice of the best partition. Second, the serial counting 
approximation relies on the assumption that the correlated genotype profiles 
are close to each other. This is true for genotype data in linkage studies, but 
in general is not true for association studies, where the proximity of corre- 
lated markers in haplotype blocks may be too coarse for immediate use. We 
are investigating a clustering algorithm to reorder the genotype profiles ac- 
cording to correlation rather than physical proximity. Finally, our work here 
points toward extensions to the use of continuous covariates, which can be 
applied, for example, to map gene expression traits to the raw measurements 
of copy number variations [Stranger et al. (2007)]. 

Acknowledgments. We appreciate the constructive and insightful com- 
ments from the editors and the anonymous reviewers, which significantly 
improved this paper. We acknowledge funding from EPA RD833825. How- 
ever, the research described in this article was not subjected to the Agency's 



20 



W. SUN AND F. A. WRIGHT 



peer review and policy review and therefore does not necessarily reflect the 
views of the Agency and no official endorsement should be inferred. 

SUPPLEMENTARY MATERIAL 

Supplementary Methods and Results for "A geometric interpretation 
of the permutation p-value and its application in eQTL studies" (DOI: 
10.1214/09- AOAS298SUPP; .pdf). The Supplementary Methods and Re- 
sults include four sections: (1) Single marker analysis and the choice of "best 
partition," (2) Description of genotype data, (3) Justification of the hyper- 
sphere assumption for the balanced binary trait, and (4) Propositions and 
the proofs. 
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